Discriminating Similar Languages: Persian and Dari

نویسنده

  • Shervin Malmasi
چکیده

Although widely-studied in recent years, Language Identification (LID) systems for determining the language of input texts often fail to discriminate between similar languages like Croatian-Serbian and Malay-Indonesian. This has brought attention to the task of discriminating similar languages, varieties and dialects – including a recent shared task [3]. Persian (also known as Farsi) and Dari (Eastern Persian, spoken predominantly in Afghanistan) are two close variants that have not hitherto been investigated in LID and we report the first results on this pair. Dari is a low-resourced but important language, particularly for the U.S. due to its ongoing involvement in Afghanistan, which has led to increasing research interest [1]. We developed a corpus of 28k sentences (14k per-language) and using character and word n-grams, we discriminated them with 96% accuracy. Out-of-domain cross-corpus evaluation was conducted to test the discriminative models’ generalizability, achieving 87% accuracy in classifying 79k sentences from the Uppsala Persian Corpus. Feature analysis revealed lexical, morphological and orthographic inter-language differences. Further to determining document languages, LID has applications in character encoding detection, statistical machine translation, inducing dialect-to-dialect lexicons and authorship profiling in the forensic linguistics domain. In Information Retrieval it can help filter documents (e.g. news articles or search results) by dialect. LID can also be used in other Natural Language Processing tasks, including the adaptation of tools like part-of-speech taggers for low-resourced languages [2]. Since Dari is too different to directly apply Persian resources, the distinguishing features identified through LID can assist in adapting existing resources. BODY Language Identification methods using surface features can distinguish close linguistic variants Persian and Dari with 96% accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Persian, Dari and Tajik in Central Asia

Individual researchers retain the copyright on their work products derived from research funded through a contract or grant from the National Council for Eurasian and East European Research (NCEEER). However, the NCEEER and the United States Government have the right to duplicate and disseminate, in written and electronic form, reports submitted to NCEEER to fulfill Contract or Grant Agreements...

متن کامل

The Inflectional “-y” at the End of of Imperative Verb in Middle (Dari) Persian

In Early Modern Persian prose and verse, verbs with a present stem accompanied by grapheme "-y" have been used to express modal concepts of imperative and command or invocation and request. Researchers, regardless of the historical changes of Early Modern Persian, believe that this structure of  subjunctive 2nd person singular has been used to express imperative mood, and that &q...

متن کامل

ASCII Based Transcription Systems for Languages with the Arabic Script: The Case of Persian

In this paper, we discuss transcription systems needed for automated spoken language processing applications in languages such as Persian that use the Arabic script for writing. The work is described in the context of a speech-to-speech translation system development for English and Persian. This system can easily be modified for Arabic, Dari, Urdu and any other language that uses the Arabic sc...

متن کامل

History of Medicine in Iran The Oldest Known Medical Treatise in the Persian Language

In this article, we describe some features of a rediscovered medical text written in old Persian (Farsi Dari) over one thousand years ago and discuss some of its significant attributes in relation to the historical background of the Iranian scientific and literary renaissance of that era.

متن کامل

A Transcription Scheme for Languages Employing the Arabic Script Motivated by Speech Processing Application

This paper offers a transcription system for Persian, the target language in the Transonics project, a speech-to-speech translation system developed as a part of the DARPA Babylon program (The DARPA Babylon Program; Narayanan, 2003). In this paper, we discuss transcription systems needed for automated spoken language processing applications in Persian that uses the Arabic script for writing. Th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • TinyToCS

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2015